Encoding High-Cardinality String Categorical Variables

نویسندگان

چکیده

Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding. This strategy breaks down when the number categories grows, as it creates high-dimensional feature vectors. Additionally, string entries, encoding does not capture information in their representation.Here, we seek low-dimensional high-cardinality variables. Ideally, these should be: scalable to many categories; interpretable end users; and facilitate statistical analysis. We introduce two approaches categories: a Gamma-Poisson matrix factorization on substring counts, min-hash encoder, fast approximation similarities. show that turns set inclusions into inequality relations are easier learn. Both streamable. Experiments real simulated data methods improve supervised learning with recommend following: if scalability is central, encoder best option any fit; interpretability important, alternative, can be interpreted inferred informative names. enable autoML original entries they remove need engineering or cleaning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Importance of Variables Semantic in CNF Encoding of Cardinality Constraints

In the satisfiability domain, it is well-known that a SAT algorithm may solve a problem instance easily and another instance hardly, whilst these two instances are equivalent CNF encodings of the original problem. Moreover, different algorithms may disagree on which encoding makes the problem easier to solve. In this paper, we focus on the CNF encoding of cardinality constraints, which states t...

متن کامل

Categorical Variables in Dea

If a DEA model has a mix of categorical and continuous variables a standard LP formulation can still be used by entering all combinations of categorical and continuous variables as different types of inputs and/or outputs. Most units will then not have positive levels of all variables. The implications for selection of peers are investigated. Peers can have the same or fewer types of inputs tha...

متن کامل

Dependent Categorical Variables 1 Running head: DEPENDENT CATEGORICAL VARIABLES Multiple Unordered Categorical Dependent Variables in Organizational Research

A model for analyzing multiple categorical dependent variables is presented and developed for use in organizational research. A primary example occurs in the foreign market entry literature, where choice of ownership (majority, equal, or minority) and “function” (acquisition or joint venture) are simultaneously endogenous; only separate univariate ownership-based and function-based choice model...

متن کامل

Structural Equation Modeling: Categorical Variables

In the behavioral sciences, response variables are often noncontinuous, common types being dichotomous, ordinal or nominal variables, counts and durations. Conventional structural equation models (SEMs) have thus been generalized to accommodate different kinds of responses.

متن کامل

Encoding Cardinality Constraints using Generalized Selection Networks

Boolean cardinality constraints state that at most (at least, or exactly) k out of n propositional literals can be true. We propose a new class of selection networks that can be used for an efficient encoding of them. Several comparator networks have been proposed recently for encoding cardinality constraints and experiments have proved their efficiency. Those were based mainly on the odd-even ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Knowledge and Data Engineering

سال: 2022

ISSN: ['1558-2191', '1041-4347', '2326-3865']

DOI: https://doi.org/10.1109/tkde.2020.2992529